Movie genres in time - a data cleaning story
I stumbled upon a movies dataset, that included film releases from as far as 1900 until recent years. It’s one of the biggest movie datasets on Kaggle, containing almost 35k entries. I decided to use that data to compress it into an overview, how movie genres popularity changed over time. The dataset is downloaded from Kaggle. The notebook with code can be found on github . Exploratory analysis After an initial look into the data I recognised two issues: * Data gathered is biased in terms of the origins of the movies - the majority of them are from the USA, the rest is from South East Asia and UK only. Looking at the count of movies based on their origin, I decided to look at American movies only. The representation of other countries seems to be insufficient to call this a general overview of movie genres, but seems to work fine to describe trends in the US movie industry. * Genre column is really messy! There are over 2264 unique genres and over 1500 that occur only once in the whole dataset. Instead of searching for other, “cleaner” data, I decided to accept the challenge. Here’s a sneak peek into how the raw values look like: We see a lot of variability in the way genres are recorded in the dataset. Many movies have few genres assigned, which is not an unusual thing, but makes grouping the genres together really hard. Some use dashes, slashes or just spaces to separate multiple genres. My goal is to get an overview, have a good sense of what kinds of movies were produced in 30s and how that compares to modern trends. Therefore I don’t need a very granular description of each movie, I will strive to clean each value to a single word. Below I will describe cleaning steps in detail. NLTK approach # First I extract the group, where more than one word describes genre and keep only nouns. My assumption is that for e.g. short action or historical romance the adjective only adds the context, while the noun is the major genre for the movie. I remove short and historical from these examples. # Next, if there are more nouns in genre description left, e.g. comedy drama I take only the first one. This is a simplification I'm choosing to make, with the assumption that first genre is most fitted. # Lastly, having a single word in each genre row, I group them by their stemming. E.g. I will be able to match romance and romantic based on their common base: rom. The results are as follows: * I’m down to 117 unique genre stemmings: * 46 values with single occurrence remained. Judging by a sample of them, there are some that could have been extracted and grouped (e.g. travel, family), but most of them are not valid movie genres. Manual approach The above result is really good. I have to admit though that it was a bit complicated and required e.g.turning to operations on numpy arrays instead of pandas DataFrame to actually be able to transform the data. Was it necessary? Actually scanning through the values that appear in the genre column made me think that a simple lambda function would suffice to extract most popular genres. It seemed simple enough as I know the scope of possible movie genres fairly well. Let's see and compare it against nltk approach. I will use the measure of the number of values that appear only once after cleaning. For nltk it was down from 464 to 46. df'manually_cleaned' = df'Genre'.apply(lambda x: 'anime' if 'anime' in x.lower() else ('sci-fi' if 'science fiction' in x.lower() else ('animation' if 'animated' in x.lower() else ('romance' if 'romantic' in x.lower() else (re.split(' |-|/|,', x)0 if len(re.split(' |-|/|,', x)) > 1 and re.split(' |-|/|,', x)0 != 'sci' else x))))) The above function implements following assumptions: * group all anime in one category (e.g. change `anime fantasy` to simple `anime`) * group all kinds of `sci-fi` spellings into one * group animated movies (`animated` to `animation`) * group romance movies (`romantic` to `romance`) * split columns like `comedy-drama` or `horror/action` and extracting only the first genre mentioned. My assumption is that the first is the most descriptive in general. And that’s it, with these 7 conditions the result is 149 unique genres, and 64 that remained ungrouped (occurred only once). Results NLTK approach plot: Manual approach plot: Summary It looks like comedy and drama have always been at the top. Recent years, after 2010 show some increase for action, horror and thriller movies. Except for the boldest colours indicating high number of movies released, there are some lighter coloured areas that show when a genre gained some popularity - westerns’ time were the 60’s, sci-fi movies peaked slightly at the same time too. Musicals were popular in the 20’s When looking at summary of top 20 genres with each of the approaches, there are some differences - nltk does not include historical or social genres, but manual approach excluded short movies and suspense. There is some loss in both ways. But for this particular project it doesn’t matter, the major trends were revealed with both approaches . The conclusion from this analysis is that if you need a general overview, you care about major trends and not single occurrences in data and you know the topic - the simplest approach may be good enough. Author: Dorota Mierzwa Category:Data analyses